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Abstract 

Cytotoxic T lymphocytes (CTLs) play an important role in shaping HIV-1 infection. In particular, during the first weeks 
of infection, CTLs select for multiple escape mutations in the infecting HIV population. In recent years, methods have 
been developed that use intra-patient escape mutation data to estimate rates of escape from CTL selection. The resultant 
escape rate estimates have been used to study CTL kill rates and fitness costs associated with particular mutations, thereby 
providing a quantitative framework for exploring CTL response and HIV dynamics using patient datasets. Current methods 
for escape rate inference focus on a specific HIV mutant selected by a single CTL response. However, recent studies have 
shown that during the first weeks of infection, CTL responses occur at 1 — 3 epitopes and HIV escape occurs through 
complex mutation pathways. In this work we develop a model of initial infection, based on the well-known standard model, 
that allows us to model the complex mutation pathways of HIV escape. Under this model, we develop two computational 
inference methods. In one, we use a Bayesian approach to construct posteriors for the parameters of our model. In the 
second, we develop methods for hypothesis testing under our model. The methods are applied to two CHAVI datasets, 
demonstrating the importance of taking into account the interaction of multiple mutant variants and multi-directional 
selection. 



1 Introduction 

Acute HIV-1 infection is marked by a period of approximately 2 — 4 weeks in which the viral population expands from a 
presumed 1 — 5 initial infected cells, in the case of sexual transmission [l], to a population of approximately 10^ infected 
cells [2||5]. An innate immune response occurs within about 5 days of initial infection, but the adaptive response takes 
more time. In particular, cellular response is correlated temporally with the end of the expansion period. i5|-jT . 

Adaptive response during the acute phase of infection includes both a T cell and antibody component, but neutralizing 
antibodies only arise towards the end of the acute phase or later in infection [slfQ]. In contrast, there is strong evidence 



suggesting that CTLs do exert a significant selective force around the time of peak viral load ^10 11 . For example, in the 
four patients considered in [Tl] , initial CTL response targeting a single epitope started at or several days before peak viral 
load and escape mutations at the epitope were fixed or nearly so 2 — 3 weeks thereafter. Shortly after the initial CTL 
response, CTL responses at 1 — 2 other epitopes was seen. Escape mutations at these epitopes were fixed or nearly so 
within roughly 4 — 6 weeks of peak viral load. 

While the description above provides a qualitative picture of initial CTL response and HIV escape, an accompanying 
quantitative description is still being developed. Put in evolutionary terms, HIV escape from CTL response forms an 
example of a selective sweep. A quantitative theory of selective sweeps does exist, e.g. |12||16| , but this theory typically 
builds off generic models that do not fit the biology of acute HIV infection in several ways. First, the HIV population size 
is not fixed during initial infection, instead HIV viral load rises by 2 — 4 logs, peaks, and then drops by 2 — 3 logs during 
the time span of the initial HIV selective sweep [2{ |11[[T7] . Second, since CTL populations change over the time period of 
the sweep, the selective force exerted by CTLs is time varying. Third, since CTLs target multiple epitopes and the fitness 
effects of mutations are complex, the selective force exerted on the HIV population is multi-directional. Fourth, the high 
mutation rate of HIV means not only that many mutant variants will arise, but also that multiple mutation pathways 
leading to escape will exist. As a result, providing a quantitative description of HIV selective sweeps during initial infection 
is a modeling and computational challenge. 

Several authors have examined selection in the context of HIV, e.g. [15||18}f2l] , but not with a focus on making inferences 
on selective sweeps in acute infection and, consequently, not with models that refiect the unique features of CTL response 
and HIV dynamics during acute infection. Techniques have been developed to infer HIV escape rates by focusing on mutant 
frequency at a single epitope fl0l [ll[[22||24| . These techniques have been valuable in quantifying the role of CTL attack in 
shaping HIV escape rates. However, the model used considers a single genome region and assumes only two HIV variants 
at that region, a wild type variant and a mutant variant. As a result, the effect of different HIV variants exposed to 
multi-directional selection is difficult to assess. 
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In this work, we describe a model and associated computational methods through which HIV selective sweeps driven by 
multi-epitope CTL response and multi-variant HIV escape can be analysed. The model is based on the standard model of 
viral dynamics [25]. Since HIV mutants that escape CTL attack do not exist at initial infection and often arise sequentially 
in time, we extend the deterministic standard model to a stochastic birth-death process that includes mutation. Intuitively, 
the birth-death process is an agent-based system that tracks the birth, death, and mutation events of individual cells 
infected by different HIV variants. Such an approach to HIV dynamics has been used previously, e.g. |26||28| . We model 
the possible mutation pathways by what we term an escape graph. The escape graph, formed through information available 
in the datasets considered, specifies the different mutations and resultant HIV variants that are involved in the selective 
sweep and tracked by the birth-death process. 

Given a model, specified through an escape graph and associated birth-death process, we develop two computational 
inference methods. First, assuming a prior distribution on the parameters of the birth-death process, we describe a 
computational approach to forming a posterior parameter distribution. Posteriors describe the most likely parameters 
conditioned on the data, but they do not directly describe the overall fit of the model to the data. So second, we describe 
a computational approach that determines the p-value of a specific escape graph and associated birth-death process. This 
allows for hypothesis testing. 

We apply our computational methods to the datasets of patients CH40 and CH58 presented in 11 29|, focusing our 



analysis on the first 2 — 3 timepoints sampled in [11]. Our methods use data that specifies frequencies of different HIV 
variants at different sampled time points, the type of data available in ^llj29j. For CH58, we focus our analysis on the first 
two regions of the founder genome at which mutations are seen to fix. One region is the epitope targeted by the initial CTL 
response and the other region is possibly associated with fitness effects. Mutations fix at these two regions simultaneously 
and we construct posteriors to analyze how these simultaneous sweeps affect each other. For CH40, we consider escape 



from the initial CTL response. Through deep sequencing data for CH40 presented in 29 , 7 different mutations at the 
epitope targeted are seen to play a part in escape. By exploiting our p-value computations to perform hypothesis tests, 
we show that variations in mutant frequency seen in the CH58 dataset are not necessarily reflective of difference in CTL 
kill rates or fitness difference, but rather may result from the stochasticity inherent in the time at which different mutant 
variants arise. 

Calculating posteriors and p-values under our model is not computationally trivial. The model is stochastic and the 
data is high-dimensional. Standard Monte Carlo approaches in which the birth-death process is simulated without any 
conditioning are not effective because the data represents a single point in a high dimensional space and the birth-death 
process rarely hits any specific point. Further, since the data is high-dimensional, defining what is meant by a p-value 
is not straightforward. We address these issue by exploiting a stochastic-deterministic decomposition of HIV birth-death 
processes introduced and explored by several authors [30 ■ 33 . Using this decomposition, we construct a reduction that 



associates the birth-death process with a stochastically simpler Markov chain. Through this Markov chain reduction, the 
posterior can be computed using a Markov Chain-Monte Carlo approach and the p-value can be defined and computed. 

2 Model and Methods 

Our model is based on the following form of the standard model for which virions are assumed to be in steady state, 25p4 



f{t) = \-dT-kT^h (1) 

V 



where T, /„ represent uninfected target CD4-f T cells and CD4+ T cells infected by HIV variant v, respectively, per fj,L. 
To allow for multiple variants, v varies over whatever set of variants is being considered, k, A, d represent the infection rate 
per target cell per fiL, target cell production rate per fiL, and target cell death rate, respectively, per day. Crucial for our 
model, 5v represent the death rates of cells infected by variant v. Instead of explicitly modeling CTL dynamics, we allow 
the parameters Sv for different variants to vary in time and in this way model the selective force exerted by CTLs or other 
fitness effects implicitly. 

To model the mutation pathways through which HIV variants escape selection, we define an escape graph by specifying 
a set of vertices and a set of directed edges. Vertices correspond to HIV variants that are part of the mutation pathway 
through which the selective sweep occurs. Two vertices, say A and B, may be connected by an edge directed from A to 
B if variant A can mutate into variant B through a single nucleotide substitution. We write A ^ B when A has an edge 
pointing to B. Figure [l] shows an escape graph with three variants : F, Ml, and M12. In the results section, this escape 
graph arises in connection to patient CH58. F represents the original founder HIV variant that infected the patient. The 
movement from variant F to Adl to M12 represents a pathway of HIV escape suggested by the patient data. The specific 
geometry of this graph, which ignores certain variants, refiects a modeling choice. Figure [2] shows the escape graph we use 
for patient CH40. 

A birth-death process describes the dynamics of the variant populations specified by the escape graph vertices and is 
an extension of ([ij). T{t) and Iv{t) represent the same populations as in the standard model, with the caveats that the 
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Figure 1. Escape Graph for CH58 




Figure 2. Escape Graph for CH40 



v's are vertices in the escape graph and the units are now per body rather than per /iL. The change in units aUows us to 
track the rise of new variants which initially only infect a single cell. The parameters of the birth-death process include 
the parameters of ([l]) with two extensions. First, we allow 6v to depend on time, i.e. we consider Sv{t) instead of just 
(5j,. Second, if A and B are connected by an edge we let ^ab be the rate at which A variants mutate into B variants. 
Throughout this work we set all such mutation rates equal to fj, — 3 * 10~^, but the methods allow for any value for any 
edge. The birth-death process is defined through the birth and death rates specified in Table [T] For example, at time 
t a single cell infected by variant v produces child infected cells with rate kT{t), meaning that in a small time interval 
[t,t + At] the probability of a new v variant infected cell arising is roughly kT{t)Iv{t)At. In both cases discussed in the 
Results section, the escape graph includes a founder vertex F. We always start the birth-death process at 'initial infection' 
which we model as t = 0, If{0) = 1 and /„(0) — for v ^ F. 

Table 1. Rates for the Birth-Death Process 



cell type 


birth rate 


death rate 


mutation rate 


T 


A 






Iv 


kT 







Since we do not explicitly model CTL dynamics, the which implicitly model CTL attack are parameters of special 
interest. In computing posteriors, estimating 5v{t) with no restrictions is beyond our current methods. Instead, along with 
the escape graph and birth-death rates we specify a list of attack intervals [0, ti], [ti,t2], . . . , [tn-i,t„] and restrict the S^{t) 
to be constant during any attack interval. For example, we often choose ti = 15 reflecting approximately 15 days before 
CTL response arises and correspondingly choose Sv{t) = .4 for t G [0, 15] which gives a half-life for infected cells falling 
between 1 — 2 days |17[|35| . Posterior construction centers on computing the values of the Sy (t) during the attack intervals. 
In the Results section, k, d, A, fj, are fixed within each patient, although these parameters can certainly be made part of the 
prior and posterior. The specific form of the attack intervals is a modeling choice and is formed by considering CTL data 
available. Start and end times for the attack intervals can be made part of the posterior as is done for patient CH40. 



2.1 Simulation of the Birth-Death Process 



Simulating a birth-death process through a Gillespie algorithm, 36 , is not computationally feasible for an infected cell 
populati on size of the order 10^. Instead we employ the idea of stochastic-deterministic decomposition used by several 
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To make this decomposition precise, for each variant v we define a stochastic interval, \t. 



authors 

the stochastic interval, Iv{t) dynamics are generated according to the following algorithm, 
Pre-Stochastic Interval Step: 



(v) 



start, *cnd]- GivCU 



For t <t 



(v) 
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stochastic Interval Step: 

For t G [istirti *cnd]' Iv{t) dynamics are generated by a Gillespie algorithm using the rates of the birth-death process. 

Post-Stochastic Interval Step: 

For t > Iv{t) dynamics are approximated using the differential equation: 

/„ = kTh - {t)h + M ^ kTh, . (2) 

v' — yv 

The above algorithm can be employed for each variant, but in practice we assume that the founder variant is deterministic 
from f = 0. This approach ignores stochasticity in the first 1 — 2 days of infection, a time period that does not affect our 
results and for which the modeling is uncertain with the advantage of improved computational time. For all other variants, 
since the Gillespie algorithm samples exactly from the birth-death process, the accuracy of the stochastic-deterministic 
decomposition improves as the stochastic interval is widened, but at the cost of greater computation time. The endpoints 
of the stochastic intervals are set through tuning parameters e and L. These two parameters tune the algorithm by implicitly 
selecting a trade-off between computational speed and accuracy. 

To explain how we determine the stochastic interval, we consider the escape graph in Figure [l] Starting with variant 
Ml, since F does not have a stochastic interval, tsffrV is defined by Mfer(tstart )^f'(^start ) — time at which the 

rate of mutations reaches e. The probability of an F to Ml mutation prior to t^J^ll} is order e. The error of ignoring such 
mutations, as is done by the Pre-Stochastic Interval Step, drops as e is decreased, t^nd^' defined by ^Mi(iend^') — L 
where lMi{t) is roughly the average population size of Ml variants. Put precisely, let t* be defined by fiklF{t*)T{t*) = 1, 
then /Mi(i) obeys i2K with initial condition i{t*) — 0. As L is increased, t^'J^H^ rises thereby increasing the duration of the 
stochastic interval and improving accuracy. The stochastic interval of M 12 is generated similarly, but with Ml playing 
the role of F. For a general escape graph, we would start with the founder and work our way outwards along the edges, 
generating a stochastic interval for each vertex. 



In 33 , through simulation the relative error produced by L = 100, e = was found to be approximately .03. Similar 



results, expressed in somewhat different settings, were found in 30 32 . Figure [3|\ shows the probability density of 
-^Afi(15), the number of Ml variant infected cells a.t t — 15, computed using the stochastic-deterministic decomposition 
with L = 2000, e = 0. Parameters were set at = 2 * 10"^, d = .01, A = 10*, and (5„ = .4 for v = F,M1,M12. These 



parameter choices fall within the typical range used for HIV, e.g. rf|[37 39 , with the caveat that A, k must be converted to 



per body units and remembering that we assume virions are in steady state. (For example, the k = 2 * 10~^ in our model 
corresponds to k — 1.2 * 10^^ in the model of [39], approximately the value cited in that paper.) The stochastic interval for 
Ml was [4.6, 13.7], so by t = 15 the Ml variants were in the deterministic portion of the decomposition. Figure [sJB shows 
the difference in the cumulative distribution between using L = 2000 and an exact Gillespie distribution run up to f = 15. 
As can be seen the maximum difference in the cdfs is less than .01. In the Results section, we make the conservative choice 
of L = 2000, although L = 100 is likely sufficient. To simulate the birth-death process, choosing e = is computationally 
feasible, however the Markov chain reduction depends on e > as will be discussed below. 




Figure 3. Accuracy of Stochastic-Determinsitic Decomposition. (A) Distribution of /j\/i(15), the number of Afl 
variants at t ^ 15, with parameters fc = 2.6 * 10"^, d = .01, A = 10*^ and S,j = .4 for v = F, Ml, M12. (B) Difference of 
cumulative distributions for /mi(15) using Gillespie simulation and stochastic-deterministic decomposition with L ~ 2000. 
Both (A) and (B) were generated using 10^ simulations. 
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2.2 Markov Chain Reduction of the Birth-Death Process 



While simulation of the birth-death process is possible using the algorithm of the previous section, inference is difficult. 
To overcome this obstacle, we reduce the stochasticity of the birth-death process to the much simpler stochasticity of a 
Markov chain. The computational approaches used to construct posteriors and p-values depend on this reduction. 

As a first step in obtaining the Markov chain reduction, we modify the simulation algorithm described in the previous 
section. Everything is as before, except that the Stochastic Interval Step is now implemented through the following sub-steps 
which do not use a Gillespie simulation. 



Modified Stochastic Interval Step, sub-step a 

"^cnd 



Set Iv{t) — for all t £ [^tirti ^ind)- During this time interval, store T{t) and /„'(/:) for all v' with an edge pointing 



to V. 

Modified Stochastic Interval Step, sub-step b 

Let B{t) be a single variant birth-death process, i.e. B{t) is a scalar, defined by -B(titart) — and with birth, death, and 
mutation rates of kT{t), 5-uit), and "^^i^^ fJ'kT{t)Iyi{t), respectively. B(t) has the same birth, death, and mutation 
rates as variant v infected cells. Define Xy = B{t^J^^). Since T{t), lyi (t) were stored in sub-step a, the distribution of 
Xy can be computed through standard methods, see [39] for an example in the context of viral dynamics. Briefly, X^ 
is computed by solving a backwards equation for the expected value E[exp[—iLjB{tl^J^)] | B{t) — I]. Then a Fourier 
transform in uj is performed to obtain the distribution of S(t'^^j) or in other words X„. 

Modified Stochastic Interval Step, sub-step c 

Using the distribution computed in sub-step b, produce a sample x„ from X„. Set Ivit^^^) — Xy. 

The above algorithm samples the number of variants at the end of the stochastic interval exactly, under the assumption that 
V variant dynamics during the stochastic interval do not effect T{t) dynamics, i.e. the number of CD4-I- target cells. Since 
a variant v will have population size of order L, for L less than 10000 the v variant population will have a frequency on the 
order of 10~^. Such low frequencies cannot be sampled by even deep sequencing approaches so ignoring the dynamics does 
not limit connection to data. Further, the error produced by ignoring such a low frequency variant in computing birth-death 
process dynamics is dominated by the error produced by the stochastic-deterministic decomposition, the sampling error 
associated with the data, the error produced by solving equations such as ([2j numerically, and the stochasticity of the birth- 
death process. For example, in the context of Figure [s] discussed in the previous section, the difference between T(15) and 
If (15) under the Modified Stochastic Interval Step as opposed to the Stochastic Interval Step is approximately 10~^. Such 
a difference is clearly dominated by the stochasticity of the birth death process and the error of the stochastic-deterministic 
decomposition as shown in Figures [3}\-B. 



Using this new simulation approximation, the stochasticity of the birth-death process is reduced to the variables Iv{t 



end / 



or in other words the draws of the samples from the distributions X„. To see this notice that except for sampling from 
Xv for all variants v ^ F, the simulation of the birth-death process is completely deterministic. We refer to X„ and Xv 
as the pop size distribution and pop size of the v variant population because, intuitively, determines how soon the v 
variant population has significant frequency and hence 'pops up' in the data. 

For appropriate choices of e and L the stochasticity of the birth-death process can be further simplified. For concreteness, 
consider the escape graph of Figure [l] If e is chosen large enough and L is chosen small enough then t^^d^' < *start'^\ 
the stochastic interval of Ml will end before the stochastic interval of M12 starts. In this case, the stochastic interval of 
Ml will be considered first, the distribution of Xmi generated, and xmi sampled. Only once xmi has been sampled will 
the dynamics be run forward to the A/12 stochastic interval. Then, Xm\2 will be generated and xmi2 will be sampled. 
Putting all this together, xmi and xm\2 are states in a Markov chain. Further, the dynamics of the birth-death process 
are deterministic other than the choices of xmi and xmi2- In this sense, the birth-death process is reduced to a Markov 
chain. For more complex escape graphs, the idea is the same with the linear Markov chain of the example being replaced 
by a Markov chain on a general escape graph. The pop size of a vertex v depends on the pop sizes sampled for vertices 
that have an edge pointing to it and the Markov chain begins from the founder vertex and moves outwards. 

In the context of the escape graph in Figure [T] forcing the Ml stochastic interval to end before the M12 stochastic 
interval begins requires, 

Mfcr(t(r))/„,(e/')<.. (3) 

The product kT{t) is between 1 — 2 in early infection and, generally, can be taken near 1. Then recalling that iMiit^,'^^^) 
is sampled from Xmi, we always have /a/i (t^nd^' ) of order L. All this gives the requirement for e, which holds generally, 
that e > fiL. 

Choosing an e > means that mutations of variant v' to v occurring prior to the start of «'s stochastic interval will 
be ignored. Such mutations have a probability of order e of occurring with the exact value depending on fc, typically 
though the probability is slightly less than e. All this means that with probability 1 — e, no such mutations occur and our 
assumption of e > does not affect the simulation of the birth-death process. Importantly, our posteriors and p-values are 
accurate for the subset of the birth-death process realizations that do not have such early mutations. 
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2.3 Computing the Posterior 



Let S{t) be the state at time t of the populations tracked by the birth-death process, e.g. for the escape graph of Figure 
S{t) — {T{t),Ip{t),lMi{f),lMi2{t))- Let 'D(ti,t2) be frequency data for the variants of the escape graph collected at 
time points t\,t2- We choose two time points simply for concreteness, any number of data time points are possible. For 
example, the CH58 data discussed in the Results section includes three time points, see Table |2] Letting 9 represent the 
parameters of the birth-death process for which we want to build a posterior, we can simulate the birth-death process and 
generate samples for S{t\), S{t2). Let T>{S{t\), S{t2)) be the data generated by simulating the birth-death process and then 
simulating the task of sampling sequences. By this we mean that first, through simulation, S'(ti), S{t2) must be sampled 



to establish the exact frequencies of the variants at times ti, t2- Then, since the data produced in 11 comes from sampled 
HIV sequences, hypothetical samples must be drawn at time t\ and t2 to form simulated data. 

Given a prior for 9, i^{9), our goal is to compute a posterior of 9 conditioned on the data. More precisely, we aim to 
compute 

P{9 1 V{S{t^),S{t2))^V{t^M))- (4) 
However, it is easier to compute a posterior for 9, S{ti), S{t2), 

P{9, S{tr), S{t2) 1 V(S{U), S(t2)) = VitiM)), (5) 

and Q can be obtained from ([5| by treating S{t\) and S(t2) as nuisance parameters. 



Bayesian posteriors are often computed through Markov chain Monte Carlo (MCMC) methods, see chapter 7 of 40 
for a review of MCMC theory applied to viral data and [4l] for a general review. In our context, implementing such an 
approach depends on being able to compute the probability, 

P{9, S{h), S{t2) ! V{Siti), S{t2)) = 'D{ti,t2)). (6) 

Once (|6| can be computed, various MCMC methods allow one to sample from the posterior of 9. Specifically, we implement 
a Metropolis-Hastings based MCMC. Such an approach is not affected if instead of computing ([6| we compute 

P{e,s{ti),sit2),v{s{ti),sit2))^f>{h,t2)). (7) 

Indeed, Q is identical to Q up to a constant factor and such a proportional factor has no affect on a Metropolis-Hastings 
MCMC. 

([7| can be expressed as a product of simpler conditional probabilities, 

P{9,S{ti),S{t2),V{S{t-i),S{t2))^Vih,t2)) (8) 
^ PiV{Siti),S{t2)) =V{ti,t2) I S(h),S{t2),9)P{S{h),S{t2) I 9)71(9). 

The factor P{'D(S{ti), S{t2)) = 'D{ti,t2) | S{ti), S(t2), 9) can be interpreted as a sampling probability. That is, conditioned 
on knowing S{ti), S{t2) and hence the frequencies of the variants at times t\,t2, what is the probability of drawing the 
data. Computation of such probabilities is standard, see for example the methods of [23| . However, P{S(t\), S{t2) \ 9), the 
probability of a given system state at t\ , t2 given a parameter choice, is not standard. 

Our approach is to use the Markov chain reduction to replace S{t\),S(t2) by for all variants v ^ F. Since the 
birth-death process under our approximation is completely determined by the pop size samples, the are interchangeable 
with the S{t). To form the posterior, we replace the S{ti), S{t2) values above by the Then the factor P{S{ti), S{t2) | 9) 
becomes P{xv for v F \ 9). This second expression is simply the probability that a Markov chain takes on a certain 
state. Since we know the distribution of each x„, P{xv for v F \ 9) can be computed in a standard manner. 

We use a random walk Metropolis-Hastings algorithm on 9 and the Xv to form the posterior. For instance, to form the 
posterior for patient CH58, 9 is 6 dimensional and the Xv, namely xmi and xmi2, are two dimensional. Our MCMC then 
operates on an 8 dimensional state space. To compute a single step of the MCMC takes approximately .2 seconds on an 
Intel 17-2600 using our C-|— I- implementation. For all results below we generated 1.5 million steps for the Markov chain 
which takes between 2 — 3 days. Various improvements in the C-|— I- implementation, most notably multi-threading, should 
significantly improve computation times. Mixing times for the MCMC are roughly on the order of 5, 000 steps. 



2.4 Hypothesis Testing 

Given an escape graph and a choice of parameters for the birth-death process, our goal is to test the null hypothesis that 
the data is formed by the model. Here we let 9 represent all the parameters of the birth-death process and the underlying 
escape graph, as opposed to the previous section where 9 represented only the parameters included in the posterior. The 
challenge lies in computing a p-value. More precisely, our goal is to compute the p-value of the data, 'D{ti,t2), given 9. 
Since the data is multidimensional, the notion of a p-value is not a priori well-defined. However, as in the case of posterior 
computations, the Markov chain reduction allows for a simplification. 

Using the escape graph of Figure[T]as a concrete case, for a given 9 there will be a pair of pop sizes xmi, xmi2 for which 
P{'D{xmi,xm-i2) ~ D{ti,t2) I 9) is maximized, here we are replacing 'D{S{ti), S{t2)) by 'D{xmi,xmi2) since the pop sizes 
completely determine the dynamics of the birth-death process. Generically, the maximum need not be unique, however 
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this is the case for the models considered in the Results section. (The case of non-unique maximums can be addressed, but 
we do not explore that issue in this paper.) Label the xmi, xmi2 that achieve the maximum as x'lj\^, x'lf\%. Then we can 
assess the p-value of D{ti,t2) by considering where {x'^\^ , x'^\%) falls in the two-dimensional density of Xmi, Xmi2- In 
other words, we use pop sizes that maximize the likelihood of the data to determine the p-value. 

More precisely, let <I?mi and $a/i2 be the cumulative densities of Xa/i and Xmi2 respectively. Then let $mi{xm\^) — 
QMi and "1? A/12 (2:^12) = gA/12. If x^i^ was chosen from the density of Xmi, Qmi would be uniformly distributed on [0, 1]. 
The same holds for qMi2- Further, note that under the null hypothesis qmi and qMi2 are independent because the sampling 
from the the pop size distributions Xmi and Xa/i2 occurs independently even though the distribution XAfi2 depends on 
Xmi- 

We can then define the p-value by computing the probability with which two independent, uniform random numbers 
are more 'extreme' than Qmi, qMi2- More precisely, let f2 be the unit square produced by combining all pairs of numbers 
both falling in [0, 1]. Let f{x) for x G [0, 1] be the shortest distance from x to either or 1, i.e. f{x) — min(a::, 1 — x). Then 
we define the p-value associated with 6 by, 

p-value of e = P{f{m)f{u2) < f {xf^T) fixt^T)) (9) 

where ui,U2 are uniform samples from [0,1]. In words, f{x'lj\'^)f{xM\^) is a measure of how far the pair 1^, Xmi^ is 
from the boundary of and we set the p-value to be the probability that two uniform samples from [0, 1] are closer to the 
boundary under this measure than x'^\^, x'^\^. For example, the pair .5, .5 will have a measure of .5^ from the boundary 
of $1 which is greater than any other point. Hence .5, .5 would have a p-value of 1. In practice, we evaluate H through 
Monte Carlo methods by sampling 10^ pairs of ui,M2 and determining the fraction of times f{ui)f{u2) < /(KA/i^)./(a;A/i'')- 
For escape graphs with more variants, the same idea is employed but in higher dimension. 

This definition of p-value may seem non-intuitive as we are comparing samples from the pop size distributions rather 
than data values. However, we aim to quantify the degree to which the data is unusual for the model. To assess this, we 
map the data to the probability space of the model through the Markov chain reduction. Then, since the Markov chain 
has a simple probabilistic structure, p-values can be defined and computed more readily. 



3 Results 

The data we consider is summarized in Figure 2 of flT', although we also exploit linkage information provided by the sequence 
data not refiected in the figure. In yflj, the time points of the data are measured in days since patient identification. So for 
example, day represents the time of initial identification rather than the day of initial infection. To make this distinction 
clear, we always use 'day x' to mean x days since patient identification and 't = x' to mean x days since initial infection. 

The data we use for CH58 is composed of time points day 9 and day 45 at which 7 and 9 sequences are available as 
well as day data in which the founder variant is homogeneous. The stochasticity associated with such small sample sizes 
is large and overwhelms the stochasticity of the birth-death process that we aim to highlight. For example, in ^23j, the 
escape rate at epitope ENV EL9 (introduced and discussed below) is estimated as .1 but with a 95% confidence interval of 
[.005, .808] (see Figure S3, row ENV 581 in [23]). Since the aim of this paper is to highlight the model, the results presented 
for CH58 below assume that the frequency data is generated from 1000 sequences at each timepoint, a modest value for 
deep sequencing datasets. Our posteriors are then analogous to the .1 value of f23^ with the stochasticity described by the 
posterior coming mainly from the birth-death process but also from the sampling of 1000 sequences per timepoint. For 



CH40, deep sequencing is available in 29 and our results correspond to the level of sampling given in that dataset. 

For both patient CH58 and CH40 we choose L = 2000, making a conservative choice to insure accuracy for the stochastic- 
deterministic decomposition. For patient CH40, since the escape graph contains only edges emanating from the founder, 
we can choose e = without worry of overlapping stochastic intervals. In this case then, our simulation and the resultant 
priors and hypothesis tests apply to all realizations of the birth death process. For patient CH58, we choose e = .06 to 
insure a separation between the Ml and M12 stochastic intervals. As a result we ignore roughly 6% of the birth-death 
process realizations. 

For patients CH58 and CH40 we set k — 2.6 * 10""^ and A: = 3 * 10"'', respectively. For both patients we set d = .01, 
A = 10*. The parameters were chosen to match the time of peak viral loads suggested by the data as well as to fall within 
a biologically reasonable range |17[|37[[38] . Choices for the S death rates and attack intervals are discussed separately for 
each patient. 

3.1 Patient CH58: 

For patient CH58, we consider data collected in the first three time points : day 0, day 9, and day 45. The patient was 
identified in Fiebig stage II and according to viral load data, day corresponds to a time slightly prior to peak viral load 



(see Figure SI in 11 ). We make the rough estimate that t = 20 corresponds to day 0. Other choices in the range t £ [15, 25] 
are reasonable, but the basic results we present do not change. 

In [11], two regions on the founder genome experienced rapid, early escape. The first region contains the ENV EL9 
epitope (see CH58.e in Figure 2 of [ll]) which elicited a "very early dominant but transient" T cell response. The second 
region which we label as ENV 830 corresponding to its location relative to the HXB2 genome (see CH58.g in Figure 2 
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of 11 ) was not associated with a known epitope or experimentally identified T cell response. Three other regions of early 



escapes were found in 11 , but the escapes at these regions came slightly after the escapes at ENV EL9 and ENV 830 and 
were still in their initial stages at day 45. ICS data showed a strong memory CDS T cell response to ENV EL9 at day 21, 
but a very weak response by day 45, see Figure 7 in 11 and Figure S5B in 23 . 

Table [2] shows escape percentages at ENV EL9 and ENV 830 for the three timepoints. The first three columns give, 
from left to right, the percentage of samples that have mutations in ENV EL9 and the ENV 830 region; ENV EL9 only; 
and ENV 830 region only; respectively. The fourth column, corresponds to the percentages found in Figure 2 of flTI and 
gives the percentage of mutants in the ENV EL9 region with the status of the ENV 830 region ig nored. Table [2] suggests 
that the escapes at ENV EL9 and ENV 830 overlap. Our goal in this section is to investigate the escape rates of the sweeps 
at these two genome locations in a way that accounts for their interaction. 



Table 2. CH58 Data 



day 


ENV EL9 and 830 


ENV EL9 only 


ENV 830 only 


ENV EL9 ignoring 830 





0% 


0% 


0% 


0% 


9 


0% 


28% 


0% 


28% 


45 


89% 


0% 


0% 


89% 



The data presented in the previous two paragraphs suggests an escape graph of the form given in [T] with _F, Ml, and 
M12 playing the role of variants of founder type, variants with escape mutations in ENV EL9 alone, and variants with 
mutations in ENV EL9 and the ENV 830 region. We separate the dynamics into three attack intervals, t — to t — 15, 
t — 15 to t = 30, and t — 30 to t = 65, through which we form piecewise constant estimates for Spit), SMi{t) and SMi2{t)- 
The time interval t = to f = 15 is meant to model the period prior to immune system attack. The split of the second and 
third attack intervals at t = 30 allows for the change in T cell attack at ENV EL9 implied by the data. 

Table |3] gives the S parameters of the model, 9 in all. However, (5a/12,i (the death rate during t G [0, 15] of M12 variants) 
is not relevant because there are no M12 variants during that time period. We set Sf,i = .4 following standard estimates 
on HIV infected cell lifetimes. The data does not bound the value of (5a/i,3 from above. Indeed, there are no Ml variants 
in the data at day 45 so there can be no upper bound on the death rate of All variants during the interval between day 9 
and day 45. Consequently, we do not view (5j\/i,3 as adding a full dimension to the posterior. Finally then, the posterior is 
6 dimensional corresponding to the parameters : 5mi,i, Sf,2, Smi,2, (5mi2,2, 5f,3,5mi2,3- 

Table 3. CH58 Parameters 



variant 


6 during [0, 15] 


S during [15, 30] 


5 during [15, 65] 


F 


6f,i 


Sf,2 




Ml 








M12 


^Afl2,l 


Sm12,2 





In terms of our parameters, we define 

escape rate 1 — 5f,2 — 5a/i,2 
escape rate 2 — 5f.3 — 5mi2,3 

which quantify the rate of escape at the two regions. This matches, in terms of our model, the escape rate definition given 



122 and used in 11 and 



Figures |4]A_-C give escape rate 1 posteriors with — .4, .6, 10, respectively. For all other parameters our prior was 

a uniform distribution on [0,2]. The value 5m\,i = 10 is not plausible biologically, but is useful in understanding the 
relationship between 5mi,i and escape rate 1. As 5m\,i rises, the number of ENV EL9 mutants in existence at f = 15, the 
time immune response begins, drops. Biologically, a larger 5mi,i means that the ENV EL9 mutants are less fit. With a 
lower number of mutants at t = 15, the escape rate on [15, 30] needs to be higher in order to fit the data, specifically the 
day 9 frequencies. Since an ENV EL9 mutant is not expected to be more fit than the founder variant, i.e. Smi.i > -4, 
and since Smi.i < 10 is reasonable, Figures |4|\-C show that the posterior of escape rate 1 falls somewhere in the range of 
.35 to .6. Figure |4p gives the posterior of escape rate 1 when using a uniform prior of [.4, .7] on 5mi,i and [.4,2] on all 
other parameters. This prior assumes a fitness cost for all mutants over the founder variant. As can be seen. Figure |4p is 
roughly a combination of Figures |4]A. and|4^ and gives a range for escape rate 1 of .4 to .55. 

Figure [5] gives the posterior for escape rate 2 which falls roughly in the range [.1, .3]. The posterior was generated with 
a prior on Smi,i and 5mi.2 that is uniform on [.4, .7] and [.4, 1], respectively. All other parameters had a uniform prior on 
[0,2]. Altering 5^/1,2 has an analogous effect on escape rate 2 as altering Smi,i has on escape rate 1. A relatively large 
value for Smi,2 means that ENV EL9 mutants will die off more quickly, allowing ENV EL9 + ENV 830 mutants to rise 
more quickly, and in turn reducing escape rate 2. Figure [6] gives a heat map for posterior values of the variables Smi,2 and 
escape rate 2 showing this inverse relationship. 
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0.35 0.40 0.45 0.50 0.55 0.44 0.48 0.52 0.57 0.59 0.61 0.63 0.40 0.45 0.50 0.55 

Escape Rate 1 Escape Rate 1 Escape Rate 1 Escape Rate 1 

Figure 4. Escape Rate 1 Posteriors. Escape rate 1 measures the rate of escape at epitope ENV EL9. Posteriors sliown 
in A-C were generated witli a uniform prior on [0, 2] for all parameters in Table [3| except that Sp.i — -4 and (A) 6mi,i — -4 
(B) 6mi,i = -6 (C) Smi,i ~ 10. The posterior in D was generated using a uniform prior of [.4, 2] for all parameters, except 
Sp^i = .4 and Smi.i was assigned a uniform prior on [.4, .7]. All posteriors were generated by running an MCMC for 1.5 * 10^ 
steps. 

Table |4] gives a set of parameter values wliicli we label as parameter set a. We also consider parameter set b, which we 
take identical to set a except with 5f.2 ~ 1-05 rather than 5^,2 = l-l- Escape rate 1 for parameter sets a and b is .5 and 
.45 respectively, both parameters sets have an escape rate 2 of .2. Table [5] gives the pop sizes and p-values for these two 
parameter sets. The pop sizes shown in the table are those associated with the p-value, i.e. the pop sizes that maximize 
the likelihood of the data. Connecting to Figure [4|3 (since 5jv/i,i = -6 in both sets) and Figure [s] parameter set a reflects 
escape rates in the middle of the posterior while parameter set b has an escape rate 1 on the left edge of the posterior. 
Correspondingly, the overall p-value drops from .42 for set a to .02 for set b. 

Table 4. Parameter Set a 



variant 


S during [0, 15] 


i5 during [15, 30] 


S during [15, 65] 


F 


Sf,i = -4 


Sf.2 = 1.1 


Sf,3 = .9 


Ml 




Smi,2 — -6 


= 1 


M12 




^M12,2 = -4 


Smi2,3 ~ -7 



Intuitively, when the birth-death process is defined by parameter set 6, a rare event has to occur to generate the CH58 
data. On the other hand, the CH58 data is 'typical' for the birth-death process defined by parameter set a. The nature 
of such rare or typical events are understood through the pop size distribution used to define the p-values. Focusing on 
parameter set a for a moment. Figure [7]4_ shows the Ml pop size distribution, Xmi- More explicitly, the stochastic interval 
of Ml variants is [6.9, 14.2] and so the pop size distribution is precisely the distribution of 7mi(14.2). Figure[7|3 shows the 
likelihood of the data given different values for the Ml pop size, xmi- The maximum likelihood is achieved at xmi — 1930. 
Returning to Figure[7]4., a pop size of 1930 has a p-value of .7 as noted in Table[5] Figures[7p,D shows the same graphs but 
for the M12 pop size with the Ml pop size fixed at 1930. Figure [S] shows the same graphs as Figure [7| but for parameter 
set b. Notice from Figure [8|3 that in this case the maximum likelihood of the Ml pop size occurs at 5330 which is far to 
the right on the distribution of Ml pop size as shown in Figure [8j\. This results in the p-value of .006 noted in Table [5] 
Combining the p-values from the Ml and M12 pop sizes gives the overall p-value. 



Table 5. Pop Sizes and p- Values for Parameter Sets a and b 





Xmi 


XM12 


Xmi p-value 


XM12 p-value 


overall p-value 


a 


1930 


4570 


.7 


.12 


.42 


b 


5330 


4150 


.006 


.12 


.02 



Figures [9]4.-C show the dynamics produced by parameter set a with pop size values as given in Table [5] Figures |9]\,B 
give population sizes in units of 10® infected cells while Figure gives population frequencies. There is about a three 
log difference shown in graph Figure |9j\,B between the time of peak viral load at t = 33 and t — 65, this is similar to the 
change in viral load seen in patient CH58. The variant percentages shown in Figure [9p at days 9 and 45 are within 1% of 
those given in Table [2] 
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Figure 5. Escape Rate 2 Posterior. Escape rate 2 measures the rate of escape in the region including ENV 830. The 
posterior was generated with a uniform prior on [0, 2] for all parameters in Table [s] except Sp i — .4 and Smi,i, Smi,2 were 
assigned uniform priors on [.4, .7] and [.4, 1], respectively. See Figure [4] for all other parameter values. The posterior was 
generated by running an MCMC for 1.5 * 10^ steps. 



Figure |9p shows that the escape dynamics associated witli parameter set a can be decomposed into two sweeps, one by 
Ml, i.e. ENV EL9 mutants, and one by M12, i.e. ENV EL9 + ENV 830 mutants. As can be seen from the graph the ENV 
EL9 + ENV 830 mutant sweep does not permit the ENV EL9 sweep to finish. We can give an ad-hoc biological explanation 
for the values of parameter set a. The ENV EL9 mutation comes at an absolute fitness cost of .2, i.e. the difference between 
(5mi,i and (5_f,i- From t = 15 to t = 30, CTL attack at ENV EL9 is enough to make up for this fitness cost and ENV 
EL9 mutants begin to push out founder variants. From t = 30 to t — 65 the force of CTL attack at ENV EL9 drops, as 
suggested by the CTL data and attack at other epitopes begins. Since the CTL attack is no longer focused on ENV EL9, 
the fitness cost of ENV EL9 mutants causes them to be selected against with respect to founder variants. However, during 
this period ENV EL9 + ENV 830 mutants rise up. ENV 830 region mutations are possibly compensatory mutation, and so 
ENV EL9 + ENV 830 mutants are not exposed to the remaining CTL attack at ENV EL9 and do not pay the fitness price 
of ENV EL9 mutants. As a result, ENV EL9 + ENV 830 mutants begin to sweep through the population. We emphasize 
that this explanation is simply meant to give some intuition to the parameter values and the resulting dynamics, we do 
not suggest that our results provide conclusive evidence for this biological interpretation. 

Finally, we note that when the birth-death process is run under parameter set a without conditioning on the data the 
dynamics will rarely fit the data. Figure [lO|\ shows the distribution of the frequency of founder variants at day 9. CH58 
data has the founder variant frequency of 72% at day 9. This value indeed has a significant density in the distribution 
shown, hence the high p- value of parameter set a, but the birth-death process will typically not generate a frequency of 72%. 
Parameter set b produces a similar density, seen in Figure [TO^B, but with .72 located in the tail of the density corresponding 
to the low p-value of parameter set b. 



3.2 Patient CH40: 

For patient CH40, early escapes were identified at two epitopes 
(CHlO.b in 
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: NEF SR9 (CHlO.t in Figure 2 in [TT]) and GAG CR9 
16, 45. NEF SR9 mutants are present in the data by day 



The first three timepoints sampled are day 
16 while GAG CR9 mutants are not present until day 45. With this in mind, we restrict our attention to NEF SR9 and the 
days and 16 allowing us to exploit the deep sequencing data for NEF SR9 described in ^29^. At day 0, the data in f29l is 
essentially homogeneous for NEF SR9. Table |6] gives the different NEF SR9 mutants found at day 16 and is produced from 
data in Figure 5 of p9]. More precisely, we considered all variants in the figure with frequency greater than 1% at day 16. 



10 



0.5 



0.6 



0.7 



0.8 



Sm1,2 

Figure 6. Posterior of escape rate 2 and Smi,2- See figure [H] for priors and |4] for all other parameter values. 

For those variants, frequencies were rounded to the nearest tenth and rescaled (by a factor of 1.033) so that frequencies 
summed to 100%. Each NEF SR9 mutant is associated with a one letter label as specified in the table. 



Table 6. CH40 Data at day 16 



label 


a. a. sequence 


frequency at day 16 


F 


SLAFRHVAR 


50% 


Q 


Q 


4.7% 


H 


— H — 


24.9% 


N 


N 


1.9% 


R 


R 


3.8% 


M 


M 


5.6% 


E 


E- 


2.0% 


C 


— C— 


5.7% 


I 


I 


1.4% 



The striking feature of Table |6] is the high frequency of H variants with respect to the other mutant variants. This 
deviation may be explained by differences in fitness, mutation rate, pMHC binding, or CTL recognition. But before turning 
to such explanations, it is valuable to consider whether the deviation could be caused by stochastic effects alone. 

To address this issue, we consider different null hypotheses assuming all mutant variants to be identical, meaning under 
our model that their birth, death, and mutation rates are equal. Figure [2] shows the escape graph we assume. We separate 
the dynamics into two attack intervals, t = Q to t = Ia, t = t a to t = id (day 16 corresponds to t = 36). Ia is the time at 
which we assume CTL attack on NEF SR9 to begin and we consider Ia between t = 12 and t = 19. Table [7] describes the 
death rates assumed by the null hypotheses. All variants, including the founder, have 5 = A for t £ [0, Ja]. For t G [4^,36], 
all mutants share the same death rate denoted 5m while the founder death rate is denoted 5f- In total we have three 
parameters that identify different null hypotheses, tA,5F, 5m- But all null hypotheses parametrized in this manner assume 
identical mutants. 

Figure [Tl] provides the p- value for different values of Ia with 5m = .4, 5f — .56. The results are similar in trend with 
respect to Ia for all 5m, 5f, but with lower p-values. The maximum p value of .011 occurs at tA = 14. The p-value is 
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Figure 7. Parameter Set a Pop Size Distribution and Likelihood. A-D are generated using parameter set a with 
k — 2.6 * 10^'^ , d — .01, A = 10*. (A) Distribution of Xmi, the number of Ml variants at the end of the Ml stochastic 
interval. (B) The likelihood of the data given different values for xmi- xmi2 is chosen to maximize the likelihood given a 
value of Xmi- (C) Distribution of Xmi2, the number of M12 variants at the end of the M12 stochastic interval. (D) The 
likelihood of the data given different values for xmi2 with xmi = 1930. 
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Figure 8. Parameter Set h Pop Size Distribution and Likelihood. A-D are generated using parameter set h with 
k — 2.6 * 10^^ , d — .01, A = 10*. (A) Distribution of Xmi, the number of Ml variants at the end of the Ml stochastic 
interval. (B) The likelihood of the data given different values for xmi- xmi2 is chosen to maximize the likelihood given a 
value of Xmi- (C) Distribution of Xmi2, the number of M12 variants at the end of the M12 stochastic interval. (D) The 
likelihood of the data given different values for xmi2 with xmi = 5330. 



small, although not exceptionally extreme. Nevertheless, one might reject this null hypothesis, especially since values of 
tA 7^ 14 generate p-values much lower than .01. Before continuing, we point out that the low p-value is a result of the many 
variants revealed by deep sequencing. For example, if all the mutants are grouped into a single mutant with a mutation 
rate of 8/i and the parameters set at tA ~ 15, Sm ~ -4 then Figure [12] shows the p-value as we consider hypotheses with 
varying 5p. At Sp = .62 a p-value of .81 is obtained. 

Having rejected the null hypothesis that all mutant variants are identical, a natural second hypothesis would be that 
all 7 mutant variants other than H are identical. We can test this hypothesis using escape graph [2] and the parameters of 
Table [t] except that we add a parameter as the death rate of H mutants during [tA,36]. Figure [TsjA shows p-values 
under Sh ~ .4, 5m = .52, 5f = .67 for hypotheses with different tA- The maximum p-value is .11 achieved as t — 12.5. 
Figure |13P is generated with identical parameters, except that Sp ~ -7. The maximum p-value is .05 which is achieved 
at the relatively late time of tA ~ 17. Figures [T3| \,B reflect a general trend, as the death rates on the founder and non-H 
mutant variants are raised, the attack time must be increased to provide significant p-values. Roughly, raising death rates 
provides a greater advantage to H variants that can be offset by shortening the duration of attack. 

While the preceding discussion provides some statistical perspective on early escape in patient CH40, we hasten to add 
that day 45 data may well change the picture. Indeed, a failure to reject the null hypothesis is not proof of its truth. 
At day 45, Q variants come to dominate the population while H variants drop to low levels (see [29] for more details). 
One explanation for these dynamics is that Q variants are more fit than H variants but are exposed to stronger selection 
by CTLs targeting NEF SR9, similar in some ways to the dynamics seen in patient CH58. Given these observations, it 
may be appropriate to remove the assumption in our null hypotheses that all mutant variants are equally fit prior to CTL 



12 






20 30 40 50 60 
Days Since Initial Infection 



40 45 50 55 60 65 
Days Since initial Infection 



20 30 40 50 60 
Days Since initial Infection 



Figure 9. Dynamics under Parameter Set a. Dynamics shown are formed using parameter set a witli xmi and xm 
given in Tablejsj (A),(B) population sizes of t lie three variants in units of 10^ infected cells. (C) population frequencies of 
the three variants. 
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Figure 10. Founder Variant Frequency Distribution at day 9. (A) parameter set a (B) parameter set b. 

response. However, one must also consider the CTL response at GAG and the lack of data linking GAG to NEF for CH40 
complicates the analysis. 



4 Discussion 

HIV escape from CTL attack during acute infection is a remarkable example of evolution. In the time span of approximately 
2 months the infecting HIV population is targeted by CTL response at 1 — 3 epitopes, experiences massive changes 
in population size, and escapes CTL selective pressure at the targeted epitopes through multiple mutation pathways 
[5)[l0)[ll]|29]. The resulting HIV selective sweeps are complex in two ways. 

suggests, many variants are involved in an 
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First, HIV selective sweeps are high-dimensional. As the data collected in 
HIV selective sweep and each such variant represents a potential dimension in a model system. Second, HIV selective sweeps 
are likely stochastic. Escape variants must initially have small population sizes which are affected by the stochasticity of 
life cycle and mutation events. This early stochasticity of escape variants can substantially affect HIV selective sweeps as 
demonstrated by our model and results. Indeed, Figure [lO] shows that significant variation in founder frequency soon after 
peak viral load can be produced by the early stochasticity of escape variant dynamics. 

Corresponding to the complexity of HIV selective sweeps, recent studies have provided high-dimensional datasets that 
track multiple HIV variants through multiple time points, e.g. 10 11,29 . In this work we have presented a model of HIV 
dynamics during acute infection that can incorporate multiple escape variants, through the escape graph, in a stochastic 
setting, through the birth-death process. Under this model, we have developed inference methods that allow us to exploit 
high-dimensional datasets. 

The results presented for patient GH58 highlight the need to consider multiple variants at once. The early escape 
dynamics for patient CH58 involved escape in two regions of the genome. The two escapes occurred simultaneously and, 
according to our inference results, affected each other. To infer such dynamics, high dimensional models that include 
multiple variants must be considered. 

The results presented for patient CH40 highlight the utility of null models in analyzing HIV escape. We know that 
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Table 7. CH40 Parameters 



variant 


6 during [0, t^] 


5 during [t^ , 36] 


F 


.4 




mutants 


.4 


hi 



nd 



D 



12 12.5 13 13.5 14 14.5 15 15.5 16 16.5 17 17.5 18 



18.5 19 19.5 



Figure 11. Null Hypotheses assuming all mutant variants are identical. Ia models the time of immune system 
response in days since infection. 



HIV escape occurs through multiple mutation pathways. Different mutations will likely have different fitness costs and 
differing levels of escape from MHC presentation and CTL killing. Multiple variants requiring multiple parameters leads 
to significant problems related to inference and over-fitting. Null models can serve a purpose in this context by shifting the 
question from identification of numerous parameters to identification of variants that are in some way unique and should 
potentially be the object of further analysis. 

While we have presented a basic framework for computational inference of HIV selective sweeps, there are features of 



our approach that require more work. For example, we have not considered the issue of parameter identifiability, see 42 for 
a review. As the number of parameters grows, understanding which combination of parameters can be inferred through a 
posterior becomes essential. The computations we have presented occurred on state spaces of modest dimension. Limiting 
our analysis to the initial CTL response allowed for this, but future work should extend to the first wave of CTL responses 
that target 1 — 3 epitopes. While our methods should apply to this greater context, understanding how to construct escape 
graphs and birth-death processes that can extract biologically useful information in this more computationally challenging 
setting requires further work. 
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Figure 12. Null Hypotheses assuming a single mutant variant. tA models the time of immune system response in 
days since infection. 
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Figure 13. Null Hypotheses assuming all mutant variants are identical except for variant H. tA models the 
time of immune system response in days since infection. 
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